-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: avro-sdc #18
feat: avro-sdc #18
Conversation
Additional info in _sdc columns header_skip, footer_skip, override_headers
README.md
Outdated
| quote_character | False | " | The character used to indicate when a record in a CSV contains a delimiter character. | | ||
| header_skip | False | 0 | The number of initial rows to skip at the beginning of each delimited file. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just for delimited files right? same for delimited, header_skip, footer_skip, and override_headers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should only be for delimited files. I've prepended delimited_
to each name to make this more clear.
| delimiter | False | detect | The character used to separate records in a CSV/TSV. Can be any character or the special value `detect`. If a character is provided, all CSV and TSV files will use that value. `detect` will use `,` for CSV files and `\t` for TSV files. | | ||
| file_type | False | delimited | Can be any of `delimited`, `jsonl`, or `avro`. Indicates the type of file to sync, where `delimited` is for CSV/TSV files and similar. Note that *all* files will be read as that type, regardless of file extension. To only read from files with a matching file extension, appropriately configure `file_regex`. | | ||
| compression | False | detect | The encoding to use to decompress data. One of `zip`, `bz2`, `gzip`, `lzma`, `xz`, `none`, or `detect`. If set to `none` or any encoding, that setting will be applied to *all* files, regardless of file extension. If set to `detect`, encodings will be applied based on file extension. | | ||
| additional_info | False | 1 | If `True`, each row in tap's output will have two additional columns: `_sdc_file_name` and `_sdc_line_number`. If `False`, these columns will not be present. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting to True I think makes more sense here. Does it apply for all file types though? Like avro I'm not sure it does 🤷♂️ we'd have to force some kind of implementation for each stream type maybe some can just return n/a or something if it doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already does default to True. required=False
, default=1
. That's just how --about format=markdown
does it.
I think it applies to all file types, or at least all file types we have so far. For avro it just increments _sdc_line_number
for each record in a file and resets the count for each new file.
tap_file/streams.py
Outdated
header_skip = self.config["header_skip"] | ||
footer_skip = self.config["footer_skip"] | ||
|
||
for reader_dict in self._get_readers(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the header and footer are being first ran through DictReader
right? I think this would cause issues for most of the use cases for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why. I've updated it to process delimited_header_skip
and delimited_footer_skip
before parsing with DictReader
.
Let's get this merged and conficts fixed, and then fix Pytest etc. Looks good to me |
Feature additions:
Closes #5
Closes #11
Closes #12